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Abstract. Document networks are characteristic in that a document node, e.g. a 
webpage or an article, carries meaningful content. Properties of document networks 
are not only affected by topological connectivity between nodes, but also strongly 
influenced by the semantic relation between content of the nodes. We observe that 
document networks have a large number of triangles and a high value of clustering 
coefficient. And there is a strong correlation between the probability of formation of 
a triangle and the content similarity among the three nodes involved. We propose the 
degree-similarity product (DSP) model which well reproduces these properties. The 
model achieves this by using a preferential attachment mechanism which favours the 
linkage between nodes that are both popular and similar. This work is a step forward 
towards a better understanding of the structure and evolution of document networks. 



PACS numbers: 89.75.Hc, 05.10.-a, 87.23. Ge, 89.20.Hh 



i: Author to whom any correspondence should be addressed. 



Triangular clustering in document networks 



2 



1. Introduction 

In recent years studying the structure, function and evolution of complex networks in 
society and nature has become a major research focus [H [21 El H] . Examples of complex 
networks include the Internet, the World Wide Web, the international aviation network, 
social collaborations between a group of people, protein interactions in a cell, to name 
just a few. These networks exhibit a number of interesting properties, such as short 
average distance between a pair of nodes in comparison with large network size [1] , the 
clustering structure where one's friends are friends of each other, and the power law 
distribution of the number of connections a node has [2]. 

This paper concerns one particular type of complex networks, the document 
networks, such as the Web and the citation networks. Document networks are 
characteristic in that a document node, e.g. a webpage or an article, carries text 
or multimedia content. Properties of document networks are not only affected by 
topological connectivity between nodes, but also strongly influenced by semantic relation 
between the content of nodes. Research on document networks is relevant to a number 
of issues, such as the Web navigation and information retrieval [El El [7]. 

Menczer [8] reported that the probability of linkage between two documents 
increases with the similarity between their content. Based on this observation, he 
proposed the degree-similarity mixture (DSM) model, which successfully reproduces two 
important properties of document networks: the power-law connectivity distribution and 
the increasing linkage probability as a function of content similarity. The DSM model 
remains one of the most advanced models for document networks. 

Recently we reported that document networks exhibit a number of triangular 
clustering properties, for example they have huge numbers of triangles and high 
clustering coefficients, and there is a positive relation between the probability of 
formation of a triangle and the content similarity among the three documents 
involved [9]. Menczer's DSM model focuses on the connectivity and content properties 
between two nodes, and it produces only around 5% of triangles in real document 
networks. There are a number of topology models which can produce networks with a 
power-law distribution of connectivity with high clustering coefficient, such as a network 
model in [THl [H] which is based on the balance between different types of attachment 
mechanisms, i.e. cyclic closure and focal closure. This model, however, do not has the 
ingredient of document content in its generative mechanisms and can not reproduce 
content-related properties of document networks. 

In this paper, we examine and model the triangular clustering properties of 
document networks. In Section El we firstly introduce two datasets of real document 
networks, we then define a number of metrics to quantify connectivity and content 
properties, and finally we review Menczer's DSM model. In Section [3] we propose our 
degree-similarity product (DSP) model, where a node's ability of acquiring a new link is 
given as a product function of node connectivity and content similarity between nodes. 
In Section H] we evaluate our DSP model against the real data and show that the model 
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Table 1. Evaluation of the degree-similarity mixture (DSM) model and the degree- 
similarity product (DSP) model against the WTlOg data and the PNAS data, 
respectively. Topological properties shown are the number of nodes N, the number of 
links L, the total number of weak triangles A and the average clustering coefficient (C). 
For each model, ten networks are generated for the WTlOg data and the PNAS data 
respectively, and results are averaged. 



Properties 


WTlOg 


DSM Model 


DSP Model 


N 


50 000 


50 000 


50 000 


L 


233 692 


233 692 


234 020±i 228 


A 


1266 730 


62 503±i 87 


1233 308±is 467 


(C) 


0.153 


0.062 ±0 .ooi 


0-121±o.ooi 


Properties 


PNAS 


DSM Model 


DSP Model 


N 


28 828 


28 828 


28 828 


L 


40 610 


40 610 


40 580 ±2 i5 


A 


13 544 


868±24 


13 583±329 


(C) 


0.214 


0.021-to.ooo2 


0.139±o.ooi 



reproduces not only the connectivity and content properties between two nodes, but 
also the triangular clustering properties involving three nodes. In Section [5] we conclude 
the paper. 

2. Triangular clustering in document networks 

2.1. Two Datasets 

In this study we examine the following two datasets of real document networks. 

• WTlOg data, which is a webpage network where a webpage is a node and 
two webpages are connected if there is a hyperlink between them. The 
WTlOg data are proposed by the annual international Text REtrieval Conference 
( http: //tree .nist .gov[ ) and distributed by CSIRO flhttp : / /es . csiro . au/T RECWeb ) . 
The data preserve properties of the Web and have been widely used in research on 
information modelling and retrieval [121 HB]- The data contain 1.7 million web- 
pages, hyperlinks among them and the text content on each webpage. We study 
ten randomly sampled subsets of the WTlOg data. Each subset contains 50, 000 
webpages with the URL domain name of .com. (A recent study has shown that 
subsets sampled from different or mixed domains exhibit similar properties [9].) 
Observations in this paper are averaged over the ten subsets. 

• PNAS data, which is a citation network where an article is a node and two article 
are linked if they have a citation relation. It contains 28, 828 articles published 
by the Proceedings of the National Academy of Sciences (PNAS) of the United 
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Figure 1. Linkage probability and triangularity probability for the WTlOg webpage 
network and the PNAS citation network. The results are compared with Menczer's 
DSM model, (a) Linkage probability P(R) as a function of content similarity R. 
(b) Triangularity probability P(R A ), in logarithmic scale, as a function of trilateral 
similarity i? A . 

States of America from 1998 to 2007. We crawled the data at the journal's website 
(http://www.pnas.org) in May 2008 and used each article's title and abstract as 
its content. 

2.2. Triangle and Clustering Coefficient 

Triangle is the basic unit for clustering structure and network redundancy P, [HI [15j [16j 
[17], EE]. Triangle-related properties have been used to quantify network transitivity [H] 
and characterise the structural invariance across web sites [T8] . 

The most widely studied triangle-related property is the clustering coefficient, C, 
which measures how tightly a node's neighbours are interconnected with each other [HH]. 
Clustering coefficient is calculated as the ratio of the number of triangles formed by a 
node and its neighbours to the maximal number of triangles they can have. When C = 1 
a node and its neighbours are fully interconnected and form a clique; and when C = 
the neighbours do not know each other at all. The average clustering coefficient over all 
nodes measures the level of clustering behaviour in a network. 

Note that triangle and clustering coefficient are not trivially related. As shown in 
Table [U the total number of triangles, A, in the WTlOg data is almost 100 times of 
that in the PNAS data. The density of triangles in the WTlOg data, measured by A/TV 
or A/L, is also many times larger. However the average clustering coefficient, (C), of 
the WTlOg data is smaller than that of the PNAS data. 
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2.3. Content Similarity and Linkage Probability 

For a given document network, we collect keywords present in all documents in the 
network and construct a keyword vector space [T9J, [20j . The content of a document is 
then represented as a keyword vector, X, which gives the frequency of each keyword's 
appearance in the document. The content similarity, or relevance, R, between two 
documents, % and j, is quantified by the cosine of their vectors: 

Rij = Rji = —>.* — ■ (1) 
ll^ll- Ml 

When R^ = 1 the content of the two documents are highly related or similar; when 
Rij = the two documents have very little in common. The linkage probability, P{R), 
is the probability that two nodes with content similarity R are connected in the network. 
It is calculated as P(R) = M*(R)/M(R), where M(R) is the total number of node pairs 
(connected or not) whose content similarity is R, and M*(R) is the number of such node 
pairs which are actually connected in the network. 

Figure [T](a) shows that in document networks the linkage probability increases with 
the content similarity, i.e. the more similar the more likely two documents are connected. 
For example in the PNAS citation network, if two articles have R = 0.5 there is a 50% 
chance that they have a citation relation, by comparison the chance is very low when 
R < 0.2. 



2.4- Trilateral Similarity and Triangularity Probability 

In document networks, if a node is similar to a second node and this second node 
is similar to a third node, then the first and third nodes are also similar. Here we 
define a new metric called the trilateral similarity, R A , which measures the minimum 
content similarly among three nodes. For three document nodes i, j and k, the trilateral 
similarity is the smallest (bilateral) content similarity between each pair of the three 
nodes, i.e. 

Rfjk = min{Rij, Rn~, Rjk}- (2) 

Similarly we define the triangularity probability, P(R A ), as the probability that three 
nodes with the trilateral similarity R A form a triangle. In this study we consider weak 
triangles, each of which is a circle of three nodes with at least one link (at any direction) 
between each pair of the three nodes. 

Figure [U(b) shows that the triangularity probability is sensitive to the trilateral 
similarity. When the trilateral similarity R A increases from 0.1 to 0.5, the triangularity 
probability increases two orders of magnitude for the WTlOg data and four orders of 
magnitude for the PNAS data, respectively. 

We note that for a given value of content similarity or trilateral similarity, the 
cube of the (bilateral) linkage probability provides the lower bound of the triangularity 
probability. But these two quantities are not trivially related because the later is strongly 
determined by a network's triangular clustering structure. 
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Table 2 


Parameters used by the two models for the datasets. 




DSM Model parameters 


WTlOff 


PNAS 


a 


0.1 


0.01 


T 


3.5 


3.5 


DSP Model parameters 


WTlOg 


PNAS 


Pi 


5 


7 


fa 


1 


4 


a 


io- 12 


io- 12 


A 


6 


8 



2.5. Degree-Similarity Mixture (DSM) Model 

The degree-similarity mixture (DSM) model was introduced by Menczer in 2004 [8]. 
The model's generative mechanism incorporates content similarity in the formation of 
document links. At each step, one new document is added and attached by m = L/N 
new links to existing documents. At time step t, the probability that the new document 
t is attached to the existing document % is 

Pr(i) =a— + (l-a)TV(z), Tr(i) oc - (3) 
mt Ru 

where % < t; is the number of connections, or degree, of node i; R is calculated from 
document content of the given network; 7 is a constant which is calculated based on 
real data; and a is a preferential attachment parameter. The first term of Equation ([3]) 
favours an old node which is already well connected and the second term favours one 
whose content is similar to the new node. The tunable parameter ^ a ^ 1 models the 
balance between choosing a popular node with large degree or choosing a similar node 
with high content similarity. 

For each of the two document networks under study, we use the DSM model to 
grow ten networks to the same size of the real network and results are averaged over 
the ten networks (see Tabled]). Table [2] gives the model parameters which are obtained, 
as Menczer [8] did, by best fitting. Menczer has shown that the DSM model is able to 
reproduce the degree distribution of document networks. Figure Q](a) shows the DSM 
model also produces a sound prediction on the relation between linkage probability and 
content similarity. 

In terms of triangular clustering properties, Table [1] shows that the model, however, 
produces only around 5% of the total number of triangles contained in the real networks 
and underestimates the average clustering coefficient of the networks. Figure [H(b) 
shows the model also significantly underestimates the correlation between triangularity 
probability and trilateral similarity. 
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3. Degree-Similarity Product (DSP) model 

In this paper we introduce a new generative model for document networks, we call it the 
degree-similarity product (DSP) model. Our model is partially inspired by the multi- 
component graph growing models of [2TJ [22] . The model starts from an initial seed of a 
pair of linked nodes. At each time step, one of the following two actions is taken: 

• Growth: with probability p, a new isolated node is introduced to the network. 
Parameter p is a constant, which is given by the numbers of nodes and links of the 
generated network, p = N/ (N + L), and determines the average node degree of the 
generated network, i.e. < k >= 2L/N = 2(1 — p)jp. 

• DSP preferential attachment: with probability (1 — p), a new link is attached 
between two nodes. The link starts from node i and ends at node j. The two nodes 
are chosen by the following preferential probabilities: 

m = iS^hry (4) 



£ ; [(*f» + ft)(J$ + a 

where k° ut is the out-degree of node i, kj 1 is the in-degree of node j, m and I run over 
all existing nodes, I ^ i. The content similarity Rij is calculated from document 
content of the given network. Parameters j3i, 02, a and A all take positive values. 
(3]_ and (5% give nodes with k out = or k tn = 0, respectively, an initial ability of 
acquiring links, a allows that even very different documents (with R ~ 0) still have 
a chance to link with each other. A tunes the weight of the content similarity in 
choosing a link's ending node. 

It is notable that Equation [5] is a product function of degree and content similarity. 
This ensures that links are preferentially attached between nodes which are both popular 
and similar. As shown in the following section, this mechanism effectively increases the 
chance of forming triangles among similar nodes. 



4. Evaluation of DSP Model 



For each of the two document networks, we generate ten networks using the DSP model 
with different random seeds. We avoid creating self-loops and duplicate links. The ten 
networks are grown to the same size as the target network. Results are then averaged 
over the ten networks. 

As shown in Tabled], the DSP model well reproduces the number of triangles and the 
average clustering coefficient of the two document networks. Figure [2] and Figure [3] show 
that the model also closely resembles the two networks' distribution of node in-degree, 
linkage probability as a function of content similarity, clustering coefficient as a function 
of node degree, and triangularity probability as a function of trilateral similarity. The 
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Figure 2. Evaluation of the degree-similarity product (DSP) model against the 
WTlOg webpage network and the PNAS citation network: (a) and (b) distribution of 
node in-degree; and (c) and (d) linkage probability as a function of (bilateral) content 
similarity. 

average clustering coefficient of nodes with in-degree k (see Figure [3]^a) and (b)) gives 
details of a network's triangular clustering structure. 

Table [2] gives the parameters used in the modelling. The value of the parameters 
are tuned for best fitting. Our simulation shows that for both the real networks, the 
best modelling result is obtained when fa (in Equation HJ) and fa (in Equation [5]) take 
different values. This suggests that node out-degree and in-degree have different weights 
in choosing the starting and ending nodes of a link. The values of fa and fa for modelling 
the WTlOg data are smaller than those for the PNAS data. This suggests that a poorly 
linked webpage has less difficulty in acquiring a new link in comparison with a poorly 
cited article. A larger value of A is used for the PNAS data. This indicates that content 
similarity plays a relatively stronger role than node connectivity in the growth of the 
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Figure 3. Evaluation of the DSP model against the two networks: (a) and (b) average 
clustering coefficient of A;-degree nodes; and (c) and (d) triangularity probability as a 
function of trilateral (content) similarity 



citation network. 
5. Conclusion 

It is known that document networks show a power-law degree distribution and a positive 
relation between the linkage probability and content similarity. In this paper, we show 
that document networks also contain very large numbers of triangles, high values of 
clustering coefficient, and a strong correlation between the triangularity probability and 
trilateral similarity. These three properties are not captured by the previous DSM model 
where a new node tends to link with an old node which is either popular or similar. 

Our intuition is that a link tends to attach between two documents which are 
both popular and similar. We propose the degree-similarity product (DSP) model 
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which resembles this behaviour by using the preferential attachment based on a product 
function of node connectivity and content similarity. Our model reproduces all the 
above topological and content properties with remarkable accuracy Our work provides 
a new insight into the structure and evolution of document networks and has the 
potential to facilitate the research on new applications and algorithms on document 
networks. Future work will mathematically analyse the DSP model, examine different 
types of triangles in document networks, and investigate the possible relation between 
the triangular clustering and the formation of communities in document networks. 
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